Robust meta-analysis for large-scale genomic experiments based on an empirical approach

Background Recent high-throughput technologies have opened avenues for simultaneous analyses of thousands of genes. With the availability of a multitude of public databases, one can easily access multiple genomic study results where each study comprises of significance testing results of thousands of genes. Researchers currently tend to combine this genomic information from these multiple studies in the form of a meta-analysis. As the number of genes involved is very large, the classical meta-analysis approaches need to be updated to acknowledge this large-scale aspect of the data. Methods In this article, we discuss how application of standard theoretical null distributional assumptions of the classical meta-analysis methods, such as Fisher’s p-value combination and Stouffer’s Z, can lead to incorrect significant testing results, and we propose a robust meta-analysis method that empirically modifies the individual test statistics and p-values before combining them. Results Our proposed meta-analysis method performs best in significance testing among several meta-analysis approaches, especially in presence of hidden confounders, as shown through a wide variety of simulation studies and real genomic data analysis. Conclusion The proposed meta-analysis method produces superior meta-analysis results compared to the standard p-value combination approaches for large-scale simultaneous testing in genomic experiments. This is particularly useful in studies with large number of genes where the standard meta-analysis approaches can result in gross false discoveries due to the presence of unobserved confounding variables. Supplementary Information The online version contains supplementary material available at 10.1186/s12874-022-01530-y.


Background
In genomic experiments and association studies, metaanalysis is a popular tool for pooling results from multiple experiments and research studies to reach an overall decision. Due to the rapid progress in technology, there has been major development of high-throughput genomic assays. It is now possible to analyze hundreds or thousands of genes at the same time. Thus, the paradigm of simultaneous inference has transformed a lot over the past few years. Moreover, huge number of available datasets in public repositories and databases have enabled researchers to assimilate large-scale genomic information from multiple studies in the form of meta-analysis [1][2][3]. Since the sample sizes of individual genomic experiments are generally small compared to the number of genes resulting in loss of power of statistical detection after adjusting for multiple testing, meta-analysis of multiple genomic experiments has been recognized as the appropriate method in order to achieve adequate sample sizes and optimal power for statistical detection [4,5]. Meta-analysis has also gained popularity as a powerful tool for combining results from multiple genome-wide association studies [6,7]. However, current meta-analysis approaches cannot accommodate the new large-scale aspect of the underlying inference of genomic experiments. The traditional metaanalysis methods, initially developed for combining results of significance testing from experiments involving only a few candidate genes, are still being applied to current large-scale experiments involving thousands of genes [8][9][10]. There are two main approaches for classical meta-analysis methods [11]. The first approach is to combine p-values of significance testing from multiple studies, and the second approach is to combine the model-based effect sizes from different studies. While both the approaches have their own advantages and disadvantages, the p-value combination methods are more flexible as they require less assumptions from the component studies and allows results from the component studies to be combined even when the individual effect sizes and standard errors are unavailable or are in different units. Some classical p-value combination methods include Fisher's p-value combination [12], Stouffer's Z-test [13], and the weighted variations of these methods [14]. In this paper we will focus on the meta-analysis methods of p-value combination.
One of the main assumptions of the classical p-value combination methods is that for a given gene, the p-values obtained from the component studies are individually uniformly distributed under the null hypothesis. However, as pointed out by Efron [15], in large-scale multiple testing problems, the p-values may not be uniformly distributed. Consequently, this raises questions on the validity of the distributional assumptions of p-value combined test statistics of the classical approaches of Fisher [12], Stouffer [13] and their variants. To apply these classical p-value combination approaches to large-scale significance testing, one needs to ensure that all the p-values obtained from the individual studies are uniformly distributed. This is important in such large-scale hypothesis testing frameworks since the aims of these experiments differ from that of a traditional single hypothesis testing scenario. In a single hypothesis test, one aims to reject an uninteresting null hypothesis in favor of some interesting alternative hypothesis with high power, e.g., 90%. However, in a large-scale genomic experiment, the number of hypotheses can easily be as large as 10,000 because of the same number of genes involved. In that case, the aim is to identify a small subset of genes, usually much less than 10% of the total number of genes, which are of most significance and are carried forward for further investigation. Thus, it is not expected from a large-scale multiple hypotheses framework to reject 90% of the 10,000 null hypotheses involved, unlike that of a single hypothesis framework. Efron [15] points out that the advantage of having thousands of null hypotheses in place of a single null hypothesis is that one can estimate the null distribution empirically and do not need to carry out testing based on some theoretical asymptotic null distribution. Empirical null distribution can be very useful in large observational studies since, unlike the theoretical null, it can take into account the additional variation and moderate bias caused by some unobserved variables (e.g., batch effects [16] or unmeasured confounder effects). Moreover, the problems caused by ignoring the effects of unobserved variables or potential confounders, and relying on theoretical null distribution for testing, can be aggravated in meta-analysis of large-scale genomic studies as discussed in Sikdar et al. [17]. In such a situation, even though a meta-analysis method has high power, it can lead to gross false discoveries of significant genes even after applying standard multiple testing correction techniques [17]. Therefore, in order to reduce the false discovery rate, it is essential to build meta-analysis methods involving large-scale hypothesis testing that are based on empirically adjusted null distribution rather than a theoretically assumed null distribution.
The idea of drawing inference based on an empirical null distribution, instead of a theoretical null distribution, was recently adopted in the context of meta-analysis by Sikdar et al. [17] and later applied in You et al. [18]. The meta-analysis method of Sikdar et al. [17], known as EAMA, modifies the classical Fisher's p-value combination method for large-scale genomic studies by empirically adjusting the null distribution through an empirical Bayes framework where the amount of adjustment depends on the extent of discrepancy between the empirical and the theoretical null distributions. However, EAMA was only limited to the classical unweighted (equally weighted) Fisher's p-value combination method which can perform poorly when there are large variations among the component studies of meta-analysis. Moreover, its performance can become unstable in certain situations as shown through simulation studies in a later section. In this article we propose a meta-analysis method for large-scale genomic experiments that implements a weighted p-value combination approach while estimating the empirical null distribution parameters through a recently developed Bayesian approach by van Iterson et al. [19] as opposed to the empirical Bayes approach of EAMA. Through a variety of simulated scenarios, we show that our proposed empirical null adjusted meta-analysis method has robust performances and works best in reducing false discoveries among several competing approaches for large-scale genomic meta-analysis especially in the presence of hidden confounders. Moreover, we demonstrate the utility of the proposed meta-analysis approach through a meta-analysis of lung cancer genomic studies.
The rest of the paper is organized as follows. In the Methods section, we describe the popular p-value combination approaches, methods for empirical null estimation, and our proposed combination of empirical null adjusted meta-analysis of large-scale simultaneous significance testing. In the Results section, we construct various simulation settings to compare the performances of our proposed meta-analysis approach with that of the other competing approaches. We also illustrate our approach through an application on a set of lung cancer genomic studies. The article ends with a discussion and a conclusion section.

Meta-analysis using weighted Z-scores
Suppose there are K independent studies and G genes in each study. Here the idea is to detect the genes that are related to the outcome of interest based on the results from the K independent studies. In other words, for each gene j, we want to test the overall null hypothesis H j : gene j does not contribute to the outcome of interest across all K independent studies, j = 1, 2, …, G. The general principle in the meta-analysis framework is to combine the results for each gene across the K independent studies to reach an overall decision for that gene.
In this section, we focus on a widely used weighted Z-score meta-analysis method based on p-values from independent studies [14], which is defined as follows: Suppose N i denotes the sample size of study i, i = 1, 2, …, K. Let Δ ij and p ij denote the direction of effect and the p-value for gene j from study i, respectively, i = 1, 2, …, K; j = 1, 2, …, G. The weighted Z-score meta-analysis method converts the direction of effect and p-value observed in each study for each gene into a signed Z-score, which is defined as The signed Z-scores, for each gene, are then combined across studies in a weighted sum where the weights are proportional to the square-root of the sample size for each study [13,14]. That is, for a gene j, the overall Z-score is defined as where w i = √ N i ; i = 1, 2, …, K, j = 1, 2, …, G. Finally, an overall p-value for the gene j is obtained as

Method for empirical estimation of null distribution
Suppose there are G null hypotheses (for example, corresponding to G genes) in a single study. Let the p-values corresponding to the null hypotheses be denoted as p 1 , p 2 , …, p G . Each p-value in the study can be converted into z-score as z j = Φ −1 ( p j ), j = 1, 2, …, G. Theoretically, the null distribution of z j is N(0, 1), j = 1, 2, …, G.
However, the large-scale multiple testing situations enable us to estimate the null distribution of z j , j = 1, 2, …, G. In this section, we will discuss a Bayesian approach, named BACON [19], for estimating the null distribution empirically.
BACON assumes that the z-scores can be modeled by a three-component normal mixture, where one of the components represents the empirical null distribution and the other two components represent two separate nonnull distributions. Here, the z-scores close to the central peak of the histogram are assumed to be generated from the null distribution, whereas those towards the left and right tails of the histogram are generated from the two different alternative distributions. The three-component normal mixture model is defined as follows: where 3 k=1 p k = 1 and φ(z, μ k , σ k ) represent the density of N(μ k , σ k 2 ), k = 1, 2, 3. This method uses a Gibbs sampling scheme to estimate the parameters of the mixture distribution, assuming conjugate prior distributions for the parameters as given below: We considered the same hyper-priors as suggested by van Iterson et al. [19]. The initial values are considered based on the median and median absolute deviation of the test statistics [19].
At each iteration the Gibbs sampling algorithm comprises of the following steps given that we have G genes resulting in G z-scores of the form z j (j = 1, 2, …, G) and the associated outcome values y j (j = 1, 2, …, G): 1) The unobserved data is generated as: 22:43 and ∼ π jk is the normalized proportion so that The following quantities are calculated: 3) Samples are generated from the posterior distributions as follows: A total of 5000 iterations and a burn-in period of 2000 iterations are recommended.

Proposed meta-analysis method based on empirically modified weighted Z-scores
In this section, we describe our proposed approach of an empirically adjusted meta-analysis that combines appropriately weighted modified z-values and computes multiple testing corrected p-values where the modification involves transforming the raw z-values through an empirical correction of the null distribution. Following are the detailed steps of our proposed meta-analysis method.
Considering K independent studies and G genes in each study, let z ij denotes the signed z-score, obtained through transformation from p-value p ij and direction of the effect estimates ∆ ij as z ij = −1 1 − p ij 2 × ij , i = 1, 2, …, K; j = 1, 2, …, G, as defined in the methods section. Since, these z-scores z ij may not follow N(0, 1) under the null hypotheses, we empirically estimate the parameters of the null distribution of the z-scores using BACON. Let f 0 μ B ,σ 2 B denote the BACON estimated null distribution of the z-scores. Using the estimated null density, we define z ij = as the modified z-score for gene j from study i, i = 1, 2, …, K; j = 1, 2, …, G. The modified z-scores z ij s, expected to be standard normally distributed under the null hypotheses, are then meta-analyzed using the weighted Z-score method as follows: The final p-values, P j s, are corrected for multiple testing using the Benjamini-Hochberg (BH) method [20].

Alternative choices for empirical null adjusted p-value combinations
In this section we discuss Fisher's p-value combination, a popular alternative to the weighted Z-score combination, which directly combines the p-values instead of transforming them into z-values. In addition, we briefly discuss another potential choice for computing empirical null distribution through an empirical Bayes method that was first proposed by Efron [15] and subsequently adopted for meta-analysis in EAMA [17]. The reason for our discussion of these methods is that one can potentially combine any of the two p-value/z-value combination approaches with any of the two empirical null computation algorithms and each such combination leads to a different empirical adjusted meta-analysis. We compare the performances of each such combination to our proposed meta-analysis method in our simulations. We briefly discuss the Fisher's p-value combination method [12] and the empirical Bayes method for estimating null distribution [15] as follows.

Fisher's p-value combination
Fisher's method combines p-values across independent studies giving equal weights to all studies [12]. Assuming K independent studies and G genes in each study, for gene j, the test statistic for the Fisher's method is defined as Under the null hypothesis that gene j does not contribute to the outcome of interest, the test statistic F j follows a χ 2 distribution with 2K degrees of freedom, assuming that the p-values p ij s are independently uniformly distributed on the interval [0, 1] for each j; i = 1, 2, …, K; j = 1, 2, …, G.

Empirical Bayes method for estimating null distribution
Efron [15] used an empirical Bayes model for estimating the null distribution of the z-scores. The z-scores for the genes are classified into two classes -"uninteresting" if z is generated from the null distribution, and "interesting" if z is generated from the non-null distribution with respective densities f 0 (z) and f 1 (z). Also, let the prior probabilities of z belonging to the "uninteresting" or "interesting" classes be denoted as p 0 and p 1 = 1 − p 0 , respectively. The mixture density of the z-scores can be defined as f(z) = p 0 f 0 (z) + p 1 f 1 (z).
Following Bayes theorem, the a posteriori probability of belonging to the "uninteresting" class given z can be defined as The aim is to estimate the null density, f 0 , from the central peak of the histogram of the z-scores. Assuming the null density, f 0 is N(δ 0 , σ 0 2 ), where the mean δ 0 is not necessarily 0 and standard deviation σ 0 is not necessarily 1, for all z-scores close to 0, we can write The parameter δ 0 can be estimated as argmax(f(z)) and . However, the estimate of σ 0 obtained by directly differentiating the spline estimate of log(f(z)) can be unstable. Therefore, one more smoothing step is applied where a quadratic curve, a 0 + a 1 x k + a 2 x 2 k is fitted by ordinary least squares to the estimated log(f k ) values, for x k within 1.5 units of the maximum δ 0 , which yields σ 0 = [−2a 2 ] − 1 2 . This approach of estimating the null distribution is called "central-matching" approach. More details about this approach can be found in Efron [15] and Efron [21]. This empirical Bayes method of estimating null distribution is referred to as EB method from now on.
Note that, incorporating EB adjustment into Fisher's p-value combination approach leads to the previously mentioned EAMA method [17]. Following this approach, one can also apply EB adjustments to the weighted Z-scores method as well as BACON adjustments to Fisher's p-value combination where each combination gives rise to a different meta-analysis method. The last two meta-analysis methods, namely, EB adjusted weighted Z-score and BACON adjusted Fisher, are also new and have not been explored before in the literature. In this article we are implementing them for the first time and will explore their performances as competing candidates to our proposed meta-analysis method in various simulation settings in the next section. We will also compare the performance of EAMA in that section.

Simulation studies
To evaluate the performance of our proposed method, we simulated continuous gene expression datasets for multiple independent experiments. We considered three simulation settings -setting 1, setting 2, and setting 3. In setting 1 we assumed there exists no hidden variable or confounder in the data. For setting 2, we assumed presence of a hidden variable which acts as a confounder, and in setting 3 we assumed presence of a hidden variable which does not act as a confounder. Details of the data generation method are described below.
We considered 10 independent experiments, i.e. K = 10 and two groups of subjects. The total number of genes in each experiment was 10,000, i.e. G = 10,000, out of which 1000 genes were assumed to be differentially expressed between the two subject groups. The log expression value, Y jlm , for the gene j, subject m in group l was generated using a linear model as given below.
where n l denotes the number of subjects in each group ,l = 1, 2. Here, μ denotes the general mean effect, α j denotes the effect due to the gene j, β l denotes the effect due to the group l, (αβ) jl denotes the interaction effect between the gene j and group l, γ jlm denotes the effect of a hidden variable or confounder, which remains unaccounted during an analysis, on the gene j, subject m in the group l, while e jlm denotes the error term.
With the above choices of the parameters of the linear model, we generated datasets for the following three simulation settings: Setting 1: In this setting, we assumed that there does not exist any effect of hidden variable or confounder. So, γ jlm = 0 for all j, l, m. Setting 2: In this setting, we assumed that there exists an effect of hidden variable which acts as a confounder. Here, we generated γ jlm as γ jlm = u jlm I(s jlm = 1), where s jlm were generated from Bernoulli(0.4) and u jlm were generated depending Y jlm = + j + l + ( ) jl + jlm + e jlm ; j = 1, 2, … , G, l = 1, 2, m = 1, 2, … , n l jlm otherwise j = 1, 2, … , G, l = 1, 2, m = 1, 2, … , n l on the gene, subject and also the experiment as follows: Here, the effect of the hidden confounder varied between the two groups of subjects, according to the different groups of genes and over different experiments.
Setting 3: In this setting, we assumed that there exists an effect of hidden variable which does not act as confounder. Therefore, we considered a simulation setting where the distribution of the hidden variable does not differ between the two subject groups. We generated γ jlm as γ jlm = u jlm I(s jlm = 1), where s jlm were generated from Bernoulli(0.4) distribution and u jlm were generated as u jlm~N (5 + i, 0.01 2 ); i = 1, 2, …, K; m = 1, 2, …, n l ; l = 1, 2.
We considered different choices for the sample sizes of the experiments and the two groups in each experiment in our simulations which will be discussed in the later sections.
After generating the data for the three simulation settings in each experiment, we used the 'limma' package in Bioconductor for testing for differential expression for the genes between the two subject groups [22]. The raw p-value and direction of effect for each gene, obtained from 'limma' , were stored. We applied our proposed method (BACON-adjusted weighted Z-score) to identify the significant set of genes. For comparison, we applied EB adjusted weighted Z-score method, EAMA, BACONadjusted Fisher method, along with standalone Fisher's method and weighted Z-score method without any empirical adjustments to identify the significant genes. A gene is identified as differentially expressed if the BH adjusted p-value is less than 0.05.
The performance of our proposed method and all the other methods in comparison were assessed using four measures, namely, sensitivity, specificity, false discovery rate (FDR) and false non-discovery rate (FNR) based on 500 independent Monte-Carlo iterations. We compared the performances of all the methods mentioned above under the following simulation scenarios for the three settings:

Unequal sample sizes of the two subject groups
When the number of samples in the two groups were unequal, we considered the total effective sample size for . In this simulation scenario, we considered n 1 = 30 and n 2 = 70 in each experiment. Therefore, the total effective sample size for experiment i is N i = 4 1 n 1 + 1 n 2 = 84 , i = 1, 2, …, K. Table 1 shows the FDR values for our proposed method and all the other methods in comparison, for the three simulation settings. We observe that in setting 1, where there exists no hidden variable or confounder, all methods, including our proposed method, have reasonably small FDR values. But in setting 2, where there exists a hidden effect of a confounder, the Fisher's and weighted Z-score methods without any empirical adjustments perform very poorly with extremely high FDR values. In the presence of a hidden variable which does not act as confounder in setting 3, all methods performed similarly, except EAMA which had slightly higher FDR (0.11) compared to the other methods. Figure 1 shows the sensitivity, specificity, and FNR values of our proposed method and all the other methods in comparison, for the three simulation settings. We observed that all the methods have very similar sensitivity and FNR values in all settings. The specificity values of all methods, except the Fisher's and weighted Z-score methods without any empirical adjustments, are also similar in all settings. The Fisher's and weighted Z-score methods have low specificity values in setting 2 in the presence of hidden confounder.

Varying sample sizes of experiments
In this simulation scenario, we considered varying sample sizes of the experiments. We considered N i = N i − 1 + 10, i = 2, …, K and N 1 = 80. The subjects were equally divided between the two groups. Table 2 shows the FDR values for the three simulation settings under this scenario.
The results were very similar to what we observed before with varying sample sizes in the subject groups for all methods in all three settings, where the Fisher's method and weighted Z-score method, without empirical adjustments, had very high FDR values in setting 2, and the FDR of EAMA was slightly high (0.12) in presence of a hidden variable which does not act as confounder. The sensitivity, and FNR values were similar for all methods in all settings, while the specificity values of the Fisher's method and weighted Z-score method, without empirical adjustments, were very low in setting 2 (supplementary Fig. 1).
Since in many biological experiments the sample sizes are lower, we reduced the sample sizes of the experiments and compared the performances of the methods. We considered N i = N i − 1 + 6, i = 2, …, K and N 1 = 20. The FDR values for the three settings are shown in Table 3.
In setting 1, the FDR values of all the methods, except EAMA, were similar, where EAMA tends to have slightly high value (0.09). In setting 2, the performances of the Fisher's method and the weighted Z-score method, without empirical adjustments, were consistently poor in the presence of hidden confounder. Additionally, the performance of the EB-adjusted weighted Z-score method gets worse with higher FDR (0.13). In setting 3, the FDR values of all the methods were similar. The sensitivity, specificity, and FNR values of all methods were very similar to what we observed before (supplementary Fig. 2).
Additionally, we considered a simulation scenario where different set of genes were differentially expressed across the experiments. We considered 500 genes as differentially expressed in the first five experiments and a separate set of 500 genes as differentially expressed in the remaining five experiments. This resulted in a total of 1000 genes as differentially expressed in at least one experiment. The sample sizes of the experiments were N i = N i − 1 + 6, i = 2, …, K and N 1 = 20. All the other choices of the parameters were same as before in all three settings. Supplementary table 1 shows the performances of all the methods in all three settings based on 500 Monte-Carlo iterations. We observed very similar performances of all the methods as we observed in the previous scenario with same genes as differentially expressed across experiments.

Reduced differential expression between the subject groups
In this scenario, we considered a reduced magnitude in differential expression of the genes between the two subject groups. To achieve this, the interaction terms were generated so that the absolute differences in the log expression values of the 1000 differentially expressed genes between the two groups was two and for all the remaining genes was zero. Additionally, we considered varying sample sizes of the experiments as we previously observed differences in performances of the methods under this scenario. We considered N i = N i − 1 + 10, i = 2, …, K and N 1 = 80. The results are shown in Table 4.
In setting 1, all methods perform well, except EAMA, which had slightly high FDR (0.09). Both EB adjusted weighted Z and our proposed method performed similar, however, the former had a slightly high FDR (0.06) in setting 1. In setting 2, where there exists an effect of hidden confounder, huge differences in the performances can be observed. Specifically, Fisher's method and the weighted Z-score method without any empirical adjustments had very poor performances with low sensitivity and specificity values, and high FDR and FNR values. EAMA had  Overall, summarizing from all the simulation results, we find that our proposed BACON adjusted weighted Z-score method has been the most consistent in maintaining the high levels of sensitivity and specificity while maintaining low or acceptable levels of false positive and false negative. Although EB-adjusted weighted Z is a good competitor of BACON adjusted weighted Z in terms of sensitivity, specificity, and FNR values, there exist instances in presence of hidden confounder (Table 3, and supplementary table 1) where EB-adjusted weighted Z has unacceptable FDR values that are much higher than the nominal type-I error rate.

Lung cancer data
We considered five lung cancer gene expression datasets, namely Bhattacharjee [23], GSE11969 [24], GSE29016 [25], GSE30219 [26], and GSE43580 [27]. These datasets were previously normalized and processed by Hughey JJ et al. [28] which are available at [29]. Each dataset had normalized gene expression levels for 7200 genes for participants with different types of lung cancer. We aimed to identify the set of differentially expressed genes between the participants with Adenocarcinoma (AD) and Squamous cell carcinoma (SQ). Four of the datasets (GSE11969, GSE29016, GSE30219, and GSE43580) had information on the smoking status, gender, and age of the participants. All participants with missing covariates were removed from the analysis. Table 5 shows the characteristics of the participants for the two cancer types in each dataset.
We tested for differential expression of the genes between AD and SQ participants using the 'limma' package in Bioconductor [22], adjusting for the available covariates, for each dataset separately. The raw p-values and the direction of the effects of the genes were stored for the meta-analysis. We applied our proposed meta-analysis method to identify the set of differentially expressed genes between AD and SQ lung cancer participants. The empirically estimated null distribution of the z-scores, using BACON [19], had mean − 0.34 and standard deviation (SD) 1.91. This suggests that the empirically estimated null distribution of the z-scores is much deviated from the theoretical null distribution, N(0, 1). After multiple testing correction with BH method [20],  we identified 2957 differentially expressed genes between AD and SQ participants at 5% significance level.
For comparison, we also applied the naïve weighted Z-score method as well as our previously proposed method, EAMA, to identify the set of differentially expressed genes. The naïve weighted Z-score method identified 4922 differentially expressed genes, while EAMA identified 1505 differentially expressed genes, at BH adjusted p-value cutoff of 0.05. A Venn diagram showing the overlap of the number of differentially expressed genes identified by our proposed method with the other two methods in comparison is given in Fig. 2. All the genes identified by EAMA were also identified by both our proposed method and the naïve weighted Z-score method. Additionally, all the genes identified by the proposed method were also identified by the naïve weighted Z-score method. Identification of so many differentially expressed genes by the naïve weighted Z-score method indicates possibility of high gross false discoveries. EAMA identified much lesser number of genes compared to our proposed method at the same BH adjusted p-value cutoff, which might reflect a situation where EAMA has lower sensitivity and/or high non false discovery rate, similar to setting 2 in Table 4. We also checked the performances of the methods without adjusting for the covariates in the studies, assuming they are hidden. The pattern of performances of the methods were very similar to what we observed after adjusting for the observed covariates, where the naïve weighted Z-score method identified a large number of differentially expressed genes and EAMA identified much lesser number of genes compared to our proposed method. Note that, even after adjusting for the observed covariates, there still might exist potential hidden confounders underlying these studies which remained unaccounted for in all our analyses. We proceed with the results adjusting for the covariates with the aim to account for all possible covariates effects that have been observed.
In order to identify biological pathways associated with the significant list of genes identified by all three methods, we performed functional annotation analysis using the software, called DAVID [30,31]. Some of the top pathways overrepresented in the significant list of genes include cell cycle, DNA replication, pathways in cancer, and p53 signaling pathway, which has been frequently found to be associated with lung cancer [32][33][34]. We also identified the pathways overrepresented in the significant list of genes identified only by our proposed method. Pathways related to lung cancer, such as non-small cell lung cancer and Foxo signaling pathway [35], were significantly overrepresented in our gene list.

Discussion
Meta-analysis is a popular tool for combining hypothesis testing results from multiple studies. It is extensively used in genomic studies, clinical studies, psychological studies, and other social sciences applications. The field of genomic experiments have undergone major changes in the past few years with the advent of modern highthroughput technologies. One such change is that thousands of genes can be analyzed simultaneously nowadays which, in turn, leads to simultaneous testing of thousands of hypotheses. While combining such large-scale multiple hypotheses testing results from multiple studies, the traditional meta-analysis approaches involving p-value combinations fail to make use of the large-scale aspect Fig. 2 Venn diagram showing the number of differentially expressed genes identified by the meta-analysis methods. This figure shows the overlap between the number of differentially expressed genes, significant at BH-adjusted p-value cutoff of 0.05, identified by the proposed method, the naïve weighted Z-score method and EAMA of the data. For instance, large-scale hypotheses testing allows empirical estimation of the parameters of null distributions without having to rely on some theoretically set null parameters. However, such provisions of empirically adjusting the null distributions are not accommodated by the classical p-value combination methods. A possible consequence of relying only on theoretical null distributions can be gross false discoveries and inaccurate inference from meta-analysis. As discussed in this article, this problem becomes more profound whenever there is a possibility of presence of some unobserved variables or unmeasured confounders. In this article we discussed some recent developments in estimating empirical null distributions and proposed ways for incorporating such empirical null distributions in meta-analysis of largescale genomic experiments. Finally, we proposed an empirically adjusted weighted p-value combination approach which estimates the empirical null distribution parameters through a Bayesian framework. We demonstrated its robustness and superiority over other metaanalysis approaches through a wide variety of simulation settings that mimic large-scale genomic testing experiments. We also applied our proposed method in metaanalysis of multiple lung cancer gene-expression studies to obtain biologically meaningful results. Although we mostly focused on microarray studies in this article, our proposed method can be easily applied for meta-analysis of expression data from other platforms (e.g., next-generation sequencing) or other type of genomic studies (e.g., DNA methylation, SNP data) as long as one can obtain p-values for each genomic feature from multiple studies.
The proposed method assumes a common null distribution across all studies, which is estimated empirically instead of relying on a theoretical null distribution. There exist meta-analysis methods that do not necessarily assume a common null distribution to account for between studies variability. However, such methods are primarily model-based approaches, e.g., random effects model, which is a different category of meta-analysis that requires information on the individual effect sizes and their corresponding standard errors to obtain a measure of between-study variability [36]. In many situations, the individual effect size estimates and the corresponding standard errors are not available. Therefore, in this article, we have focused on those meta-analysis methods that require only p-values from individual studies.
In this article we aimed at improving the meta-analysis method of large-scale genomic testing studies by modifying the classical p-value combination methods through empirical adjustments. These classical p-value combination methods aim to test for significance of a gene in at least one of the component studies and the method proposed in this article is based on the same principle of significance testing. There exists another approach of meta-analysis of significance testing results that focuses on testing for significance of a gene in the majority (e.g., 70%) of the component studies. There have been some recent p-value combination methods that aimed for this second type of meta-analysis [11,37]. Since these methods test hypotheses which are conceptually different from the hypotheses we are testing and have a different aim, we have not discussed them in this article. Nevertheless, the empirical adjustments, which we applied in our meta-analysis method, can also be extended to the metaanalysis methods of the second type if the main aim is to find significant genes in the majority of component studies. In future, we plan to pursue this approach of empirical adjustments to the second type of meta-analysis.

Conclusion
In this article, we have highlighted the drawbacks of the classical p-value combination methods for significance testing in large-scale genomic experiments. These classical p-value combination methods rely on a theoretical null distribution which can be different from the true null distribution especially in the presence of confounding variables in large observational studies. We have proposed a robust meta-analysis approach of p-value combination which modifies the p-values through the computation of an empirical null distribution. Our proposed meta-analysis approach can account for the effects of unobserved variables and confounders and has been shown to perform better than the classical p-value combination methods and other competing meta-analysis techniques. Overall, we believe that our proposed metaanalysis approach can help in accurate identification of truly significant genes by combining the findings of multiple large-scale genomic experiments.